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Rule-based  evidence  accrual  system  for  image  understanding 
Nelson  Marquina 

Artificial  Intelligence/Image  Understanding  Section 
Honeywell  Systems  and  Research  Center 
2610  Ridgway  Parkway.  Minneapolis.  Minnesota  55413 

Abstract 

The  main  function  of  an  evidence  accrual  system  for  image  understanding  is  to 

sequentially  update  information  on  scene  objects  based  on  new  sensor  data  or  on 
nor.-scnsciy  information  such  as  intelligence.  This  paper  presents  a  concept  for 

oi-qiK  ntially  updating  information  on  scene  objects.  Scene  objects  and  background 
(clutter)  are  represented  by  attributed  relational  graphs  in  which  nodes  represent 
GLjects  of  interest  and  arcs  represent  inter-object  relations.  Dynamic 

r ectqni tion/identif ication  of  nodes  is  acomplished  by  a  belief/disbelief  measure.  Our 
experimental  results  with  infrared  images  show  Improvements  in  natural  scene  otject 
recognition  over  traditional  image  processing  methods. 

Ictioduction 

Hor.eywcli  Systems  and  Research  Center,  in  a  research  contract  with  Air  Force  Office 

of  Scientific  Research  (AFOSR) ,  is  currently  working  with  real  time  air-to-ground 

sequences  of  infrared  (IR)  and  range  images  with  the  objective  of  recognizing  and 
identifying  scene  objects  of  interest. 

Our  approach  to  this  problem  is  based  on  the  notion  of  incremental  acquisition  of  the 
scene  model.  Automatic  object  screcner  systems  operate  on  video  image  frames,  extract 
objects  in  a  frame,  and  optimally  classify  these  objects  into  objects  and  background 
based  on  their  statistical  and  semantic  features.  The  performance  of  the  system 
(probabilities  of  false  alarm  and  detection)  depends  on  the  quality  of  data  and  the  image 
segmentors  and  the  classifiers  used  by  the  system.  The  full  potential  of  the  segmentors 
and  the  classifiers  is  often  not  achieved  because  of  severe  system  noise. 

Misclassif ication  may  be  reduced  by  examining  the  extracted  objects  and  the 
classifier  decisions  on  these  objects  over  a  sequence  of  lmsge  frames.  1,  2  This 
approach  is  useful  and  effective  when  noise  in  the  image  understanding  system  results  in 
random  noise  in  the  processed  image  or  random  error  in  the  feature  values  of  the 
extracted  objects,  and  the  noise  or  the  error  is  uncorrelated  from  frame  to  frame.  When 
the  image  is  noisy,  an  object  may  fail  to  meet  the  segmentation  criteria  of  the  system, 
resulting  in  a  misclassif ication.  When  the  feature  valueB  of  the  extracted  objects  are 
erroneous,  there  may  be  missed  objects  as  well  as  false  slarms.  By  accumulating 
information  from  one  frame  to  the  next  regarding  the  locations  and  the  feature  values  of 
the  extracted  objects,  improved  misclassif Ication  and  detection  can  be  achieved.  Some 
reasons  for  the  limitations  of  single-frame  analysis  approach  are: 

1 .  Objects  in  the  scene  may  be  occluded  in  any  particular  view. 

2.  Because  of  the  high  noise  content  of  an  air-to-ground  IR  image,  it  would  be  difficult 

to  interpret  all  the  scene  segments. 

3.  Errors  in  analyzing  and  interpreting  an  image  may  Cftte  errors  end  inconsistencies 

in  the  scene  description. 

Our  method  Involves  using  multiple  views  of  the  scene  in  a  sequential  manner.  The 
different  views  are  obtained  via  sensor  motion  (e.g.,  flying  airplane  or  helicopter) 
and/or  scene  object  motion  (e.g.,  moving  vehicles).  A  partial  scene  description  using 
Attributed  Relational  Graphs  (ARC'S)  *  is  derived  from  each  frame.  As  each  successive 
frame  is  analyzed,  the  model  of  the  scene  is  incrementally  updated  wit),  information 
deriv  d  from  the  current  frame.  The  model  is  initially  a  crude  representation  cf  the 
scene  in  which  some  objects  may  have  been  recognized,  but  most  of  them  remain  buried  in  a 
segment  such  as  the  case  when  background  and  object  do  not  have  high  enough  contrast, 
noticeable  texture  differences  or  objects  have  not  moved  significantly.  An  important 
aspect  of  dynamically  updating  the  scene  model  as  each  frame  is  analyzed  is  the  effective 
use  of  scene/image  knowledge  in  the  interpretation  process, 

particularly  on  methods  and  techniques  for  aggregating  and  mapping  preliminary  region, 
boundary,  and/or  surface  information  into  higher-level  descriptions. 
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The  next  section  presents  attributed  relational  graphs  in  the  context  of  natural 
scene  knowledge  representation.  Later  on  we  discuss  how  to  update  the  scene  node:  by 
using  be] itf/disbel ief  measures. 

Scene  Representation  Using  Graphs 

Relational  graphs  have  been  used  in  several  applications  such  as  chemical  structure 
description,  picture  encoding,  relational  data-base  systems  for  pictures,  network 
representation,  switching  theory,  etc.  The  graph  representation  of  images  offers  several 
powerful  capabilities'  that  are  useful  for  image  understanding  such  as  the  proper  handling 
of  the  actual  dimensional ity  and  hierarchy  of  the  images,  the  topological  invariance,  and 
the  ability  of  having  attributes  (or  features)  attached  to  their  nodes  and  arcs  or 
branches.  Generally  speaking,  an  Attributed  Relational  Graph  (ARG)  is  a  data 
structure  defining  an  expected  collection  of  objects,  such  as  an  outdoor  scene,  the 
expected  visual  attributes  associated  with  the  objects  in  the  scene  (each  of  which  can 
have  an  arsociated  ARG  such  as  a  syntactic  decomposition) ,  and  the  expected  relations 
among  them.  For  example,  an  outdoor  scene  can  consist  of  classes  *sky,*  ■hill,*  'road,* 
■vehicle,*  ‘tree,*  and  'background.*  The  class  'vehicle*  can  have  an  ARG  to  represent 
the  different  types  of  vehicles  expected  in  the  scenario  under  observation.  Furthermore, 
each  type  of  vehicle  is  decomposed  into  its  major  parts  such  as  'engine,*  "body,*  etc. 
The  scene  model  represented  by  ARG’s  is  sequentially  updated  by  analyzing  new  frames. 
The  interpretation  of  scene  objects  is  dynamically  accrued  over  time  and  convergence  of 
interpretation  yields  the  recognition  of  objects.  The  interpretation  stage  is  performed 
by  a  rule-based  system  composed  of  production  rules  representing  domain-specific 
knowledge  about  the  scenario  under  observation.  The  information  available  to  the 
rule-based  system  Is  composed  of  four  classesi  knowledge  of  form  (shape,  relative  size, 
etc.),  spectral  characteristics  (IR  signatures,  texture  measures,  etc.),  plausible 
relations  with  other  objects  (convoy  formation,  on-road/off-road  vehicles,  etc.),  and 
temporal  profile  (velocity,  maneuverability,  etc.).  Interpretation  rules  relate  Image 
events  to  knowledge  events  by  providing  evidence  for  or  against  an  object-hypothesis. 


Decision  smoothing  techniques  are  utilized  In  object  tracking  systems  to  increase 
oigect  tracking  recognition  confidence  and  decrease  false  alarm  rate.  Typically, 
statistical  methods  are  employed  In  decision  smoothing.  These  include  accepting  the  mean 
'average)  or  mode  (majority)  target  type  over  time,  or  using  probabilistic  models  such  as 
the  Bayesian  normal  '  or  binomial,8  to  evaluate  the  likelihood  of  the  various  object 
types  based  on  sequential  observations.  Simple  atatlstical  approaches,  such  as  the  mean 
and  mode,  suffer  because  they  fail  to  account  for  auxiliary  evidence  of  recognition 
accuracy  which  may  be  available  as  a  by-product  of  the  detection,  segmentation, 
clasaif ication,  and  tracking  processes.  Schemes  which  attempt  to  use  hypothesis  testing 
theory  [7),  |8]  to  accumulate  recognition  confidence  are  unsatisfactory  because  they  rely 
heavily  upon  assumptions  of  statistical  models  of  the  'decision  population*  which  tend 
not  to  reflect  the  true  nature  of  the  observations,  fn  particular,  ‘evidence*  for  belief 
in  various  object  classifications  may  come  from  several  (statistically  unrelated) 
sources.  Thus,  recognition  of  an  object  as  a  tank  with  80t  confidence  does  not  generally 
imply  that  the  confidence  of  other  object  types  is  less  or  equal  to  2Bt.  If 
classification  was  being  done  statistically  on  sets  of  features,  it  might  well  be  the 
case  that  we  attribute  different  confidences  to  the  distributions  of  different  object 
types  in  different  feature  spaces.  In  that  case,  classification  results  such  as:  'tank, 
8B»,  ARC,  40t,*  could  make  sense.  Put  another  way,  observation  of  'tank,  8B»*  is  not 
necessarily  equivalent  to  the  conclusion  ‘not  tank,  28%.*  To  utilize  Bayes  theorem  to 
calculate  the  probability  of  ‘classification  tank*  given  observed  evidence  (such  as  a  set 
of  features  or  a  geometric  arrangement  of  object  components),  requires  that  we  know  the 
probability  o{  observing  the  evidence,  given  that  the  object  is  a  tank.  yet  the  latter 
is  precisely  equivalent  to  the  problem  of  modeling  the  object  signature. 

Lack  of  well-apeclf led  mathematical  models  confounds  statistical  accrual  of 
classification  confidence  in  many  other  areas  of  human  endeavor  including  prediction  oi 
economic  trends,  weather  forecasting,  and  medical  diagnosis.  Yet  experts  in  these  areas 
arc  often  able  to  draw  accurate  conclusions  on  the  basis  of  incremental  (i.e., 
sequentially  obtained)  observations  of  evidence  relevant  to  their  conclusions. 
£h(.rtiiffe  and  Buchanan  8  have  devised  a  method  for  incremental  accrual  of  classification 
confidence  which  is  motivated  by  the  techniques  employed  by  human  experts  in  medical 
diagnosis.  It  was  implemented  originally  in  the  RYCIN  expert  knowledge  system  *e,  and  is 
now  used  routinely  In  the  knowledge  engineering  field.  The  theory  assumes  that  one  car. 
formulate  approximations  for  a  priori  and  conditional  probabilities,  but  instead  of 
treating  these  as  strict  statistical  entities  it  uses  them  to  determine 
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measures  of  ‘belief*  and  ‘disbelief.*  These  belief  measures  are  in  turn  used  to  define 
measures  of  confidence,  and  rules  for  incrementally  updating  both  the  belief  and 
conf  ider.ee  measures  are  detailed. 

The  belief  measures,  as  they  are  implemented  in  expert  knowledge  systems,  are  not 
time  adaptive.  That  is,  once  some  evidence  for  belief  (or  disbelief)  in  a  hypothesis  is 
accrued,  its  significance  and  numerical  value  never  decreases.  In  the  expert  knowledge 
applications,  such  as  medical  diagnosis  or  mineral  prospecting,  this  makes  sense.  During 
the  time  span  of  investigation,  conditions  are  static*  i.e.,  the  disease  or  mineral 
deposits  do  not  change  their  characteristics.  However,  this  is  not  true  of  object 
signatures  even  over  short  time  spans.  Noise  or  low  contrast  may  cause  isolated  frame 
mis-classif ication.  it  is  desirable  for  the  significance  of  this  single  observation  to 
decrease  over  time.  He  have  adapted  the  basic  knowledge  engineering  belief  measures  to 
incorporate  temporal  context.  Thus,  hypotheses  are  formulated  frame  by  frame,  and  time 
constants  have  been  added  to  the  incremental  updating  rules  for  absolute  belief  aesures. 
Further  temporal  accrual  of  belief  depends  upon  the  gradient  of  the  disbelief  measure  in 
the  time  domain,  and  vice  versa. 

Belief  and  Confidence  Measures 

Suppose  we  have  a  set  of  possible  object  types  Tj,  Tj,...,  Tn,  and  a  time  sequence  of 
frames  through  which  we  have  tracked  an  object.  Assume  that  in  frame  i,  evidence  ejj  is 
observed  which  supports  the  hypothesis,  hjj,  that  the  tracked  object  is  actually  object 
type  Tj.  Assume  also  that  confidence  measure  P(hjj)  and  P(hjj/ejj)  with 

•  £  P<hii)  £  1 

B  £  Khjj/ejj)  £  1 

are  calculated.  P(hjj)  is  interpreted  as  the  a  priori  confidence  that  in  the  i-th  frame, 
the  tracked  object  is  Tj,  and  P(hjj/ejj)  is  the  conditional  confidence  that,  after 
observing  evidence  ejj  in  Trane  i,  the  tracked  object  is  type  Tj. 

Define  conditional  measures  of  belief  and  disbelief  that  the  tracked  object  in  t.  e 
i-th  frame  is  type  Tj  by: 


1 

■h^.e 

13  *3  l  1-P(hi;j) 

i*w{: 


I-PCh^) 


if  Pth^l-l 

if  Plh^J/U 
if  P(hi:j).e 

if 


Note,  that  if  MBthij,  ejj)  >  ■,  then,  MD(hjj,  ejj)  -  0  and  vice  versa.  Thus,  if  the 
evidence,  ejj,  increases  belief  in  hjj,  then  it  cannot  contribute  to  the  disbcliel  in 
hjj.  Def ine  absolute  measures  of  belie!  and  disbelief  in  the  i-th  frame  by 


PB ( h  j ^  >  « «KB  < h  w 3 ) +KB <h t ^ 


FD(hi:i)«fBHD'hi_jj)+HD(h13 


A-  l-.NDIh^.e^ 


,elj)(l-AKB.hi.lj)) 

,*ij3  1 1— BMOfP i— , ^ J  j 
)  and  B-  1-  flMBth^ 


if  HDIh^J-l 
if  WHh^Ml 

if  NBfhjJ-l 
ij 

if  WMH^l/fl 
.e^)  and 


where 


when  the  time  Constanta ,« and  0  are  fixed  real  numbers,  0<«*£l, 

t '  0  £  1.  Finally  define  the  confidence  that  the  tracked  object  is  type  Tj  based  on  the 
tv.dence  accumulated  through  the  i-th  frame  by 

CFitij)  *  Mfl(hij)  -  HD(hij) 

Then  -l£  CFlhij)  £  1. 

This  follows  because  P(hjj)  and  P(hjj/eij)  are  between  £  and  1.  so,  •  £  KBthjj,  eyj)  £  1 
and  0  £  MDttij/’eij)  £  1.  From  this  we  calculate  0  £  KB(hjj)  £  1  and  0  £  MD(hjj)  £  1.  In 
order  to  implement  the  sequential  frame  confidence  scheme  outlined  before,  it  is 
necessary  to  define  a  priori  single  frame  confidence  P(hj<),  and  conditional  single  frame 
confidence  P(hjj/ejj),  which  depends  on  the  evidence  for  "r,jj  (the  hypothesis  that  in  »!•<• 
i-th  frame  the  tracxed  otject  is  of  type  j)  observed  in  the  i-th  frame.  These  measures 
are  necessarily  dependent  on  the  method  used  for  object  segmentation,  since  evidence  for 
classification  is  derived  during  this  process.  We  have  developed  ways  of  computing 
F(t.jj)  and  rihij/eij)  for  syntactic  and  statistical  classifiers. 

Syntactic  classification  is  based  upon  matching  extracted  target  components  to  target 

. >!s.  Importance  of  components  in  the  target  model  is  given  by  weights  w* ,  0£W|(£l , 

with  W|,»l.  Here,  wk  is  the  significance  of  matching  the  k-th  component  in  the  target 
model.  Note  that  ‘matching*  must  reflect  the  relative  geometry  of  the  target 
component*.  Then 

*  H  wt  ■  E*‘ 

Hatched  missed 

components  components 


defines  a  preliminary  confidence  in  the  partial  match  cf  a  candidate  object  to  a  target 
model.  If  only  m  of  K  components  in  the  candidate  object  were  matched  in  the  model,  then 
it  makes  sense  to  reduce  confidence  in  the  match  proportionately.  The  confidence  Cj  are 
ootair.ed  by  updating  the  initial  value  Cj  as  a  function  of  frame  number  and  then  obtain 
P(hlj/%i3)  by 

P(hij/eij)  -  r(Cj) 

where  r(x)  is  a  one-to-one  non-linear  rescaling  function  which  maps  1-1,11  onto  10,11  and 
also  mats  1 1/2,1]  onto  (1/2,1].  Then  ejj,  the  evidence  in  the  i-th  frame  that  supports 
deciding  in  favor  of  target  type  Tj,  is  given  by  the  match  of  the  extracted  components  to 
those  ir.  the  target  model.  Since  We  have  a  sequence  of  frames  through  which  an  object  is 
tracked,  we  define  the  a  priori  confidence  in  the  first  frame  to  be  P(hjj)  »  l/(n+l) 
where  n  object  types  are  possible.  The  n+1  represents  the  ‘probability*  of  *non-target, * 
so  that  the  tracked  object  necessarily  matches  one  of  the  types.  For  the  second  frame, 
define  p (hj j 1 *P (h j j/«ij 1  •  In  the  i-th  frame,  i>2  define 


P(t,ij’*  £  "Vk-1  P(hkj/,kj)  • 

k-1 


The  F(hij)  represents  the  weighted  sum  of  historic  evidence,  up  to  the  i-th  frame,  in 
favor  of  the  target  being  classified  as  type  Tj. 

For  the  case  of  a  statistical  classlfer,  segmentation  results  in  a  set  of  feature 
values  !xj , . . . ,Xg,)  rather  than  extracted  components.  A  parametric  and  a  non-parametric 
scheme  for  calculating  P(hjj/ejj)  have  been  developed.  They  are  currently  under 
experimental  evaluation. 
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Summary 

An  Application  of  belief  measures  for  decision  smoothing  over  multiple  frames  has 
been  presented.  The  same  scheme  generalizes  to  update  confidence  in  any  entity  tracked 
through  frames,  providing  *a  priori*  and  ‘conditional*  confidences  can  be  attached  to  the 
entity  at  each  frame.  For  instance,  appearance  of  individual  scene  objects  for  syntactic 
claseif ication  is  amenable  to  this  process.  Bach  object  is  matched  against  target 

models,  and  single  frame  confidence  derived  from  the  *goodness-of-match. *  Another 
example  is  confidence  in  tracked  object  velocity  calculated  between  frame  pairs.  Single 
frame  (pair)  confidence  is  calculated  from  agreement  (or  lack  of  it)  between  current 
frame  location  and  predicted  location  from  the  previous  frame  velocity  vector. 

A  different  set  of  applications  of  belief  measures  to  image  understanding  exist  which 
ate  not  time  dependent.  (Setting  the  time  factors  for  rule  update,  and  ,  equal  to  zero 
elinir.ates  time  dependence  in  these  rules.)  Suppose  that  we  wish  to  make  a  binary 

decision  about  a  property  of  an  object  in  an  image,  such  as  'round  or  not  round*,  ‘match 
or  not  matcn  a  model  or  template*,  ‘component  is  merged  or  not*,  'component  is  a  fragment 
or  not*,  etc.  Suppose  that  we  have  a  set  of  presumably  Independent  mesures,  each  of 
which  captures  some  aspect  of  the  decision,  and  to  each  of  which  is  associated  a 

confidence  measure  between  t  and  1.  For  instance,  a  measure  of  soundness  is  given  by 
smallest  vSt  iron  a  circle,  by  the  variance  of  the  set  of  curvatures  calculated  at  each 
point  on  the  perimeter  of  the  object,  and  also  by  the  ratio  of  perimeter  to  pixel  area. 
For  each  cf  these  mesures,  a  normalized  scale  of  distance  of  the  measure  from  that 
produced  by  a  circle  can  be  calculated,  yielding  a  confidence  measure.  The  problem  is  to 
accrue  the  strength  of  these  various  confidences  to  decide  the  total  confidence  in  the 
deciiion  ‘circle — not  circle.*  A  solution  is  to  order  the  measures  arbitrarily  and  treat 
them  ar  sequentially  obtained  Information  (even  though  they  can  be  obtained  in 

parallel).  Tnen  the  scheme  outline  in  this  paper  (with  time  constants  •<  and  0  set  to 
zero)  can  be  applied  to  yield  a  single  confidence  measure.  This  mesure  can  then  be 
thresholded  to  determine  the  binary  decision. 

It  is  also  possible  to  consider  more  complex  combining  schemes  if  more  information, 
sue),  as  degree  of  dependence  of  measures,  or  independent  confirmation  of  a  measure,  is 

available. 
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